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Abstract 

One  of  the  primary  challenges  of  developmental 
robotics  is  the  question  of  how  to  learn  and  repre¬ 
sent  increasingly  complex  behavior  in  a  self-motivated, 
open-ended  way.  Barto,  Singh,  and  Chentanez  (Barto, 
Singh,  &  Chentanez  2004;  Singh,  Barto,  &  Chentanez 
2004)  have  recently  presented  an  algorithm  for  intrin¬ 
sically  motivated  reinforcement  learning  that  strives  to 
achieve  broad  competence  in  an  environment  in  a  task- 
nonspecific  manner  by  incorporating  internal  reward 
to  build  a  hierarchical  collection  of  skills.  This  pa¬ 
per  suggests  that  with  its  emphasis  on  task-general, 
self-motivated,  and  hierarchical  learning,  intrinsically 
motivated  reinforcement  learning  is  an  obvious  choice 
for  organizing  behavior  in  developmental  robotics.  We 
present  additional  preliminary  results  from  a  gridworld 
abstraction  of  a  robot  environment  and  advocate  a  lay¬ 
ered  learning  architecture  for  applying  the  algorithm 
on  a  physically  embodied  system. 

Introduction 

One  of  the  primary  challenges  of  developmental  robotics 
is  the  question  of  how  to  learn  and  represent  increas¬ 
ingly  complex  behavior  in  a  self-motivated,  open-ended 
way.  We  argue  in  this  paper  that,  equipped  with  recent 
advances  pertaining  to  temporal  abstraction  and  hier¬ 
archy,  reinforcement  learning  (RL)  provides  a  promis¬ 
ing  framework  for  learning  and  representing  hierarchical 
skills.  Indeed,  we  are  presently  engaged  in  ongoing  re¬ 
search  in  intrinsically  motivated  reinforcement  learning, 
an  approach  introduced  by  Barto,  Singh,  &  Chentanez 
(2004)  wherein  the  primary  reinforcement  signal  is  gen¬ 
erated  within  the  agent,  allowing  it  develop  broad  com¬ 
petence  in  an  environment  in  an  open-ended  fashion. 

However,  when  applied  naively  to  robotic  tasks,  RL 
methods  often  struggle  with  the  continuous  and  high  di¬ 
mensional  state  and  action  spaces  and  insufficient  learn¬ 
ing  experience.  In  some  cases  a  simpler  and  more  ele¬ 
gant  solution  is  to  layer  learning,  so  that  RL  takes  place 
not  over  the  raw  sensor  space,  for  instance,  but  rather 
over  a  learned  economical  representation  of  that  space 
which  facilitates  RL. 
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Thus,  in  this  paper  we: 

•  advocate  a  layered  approach  to  learning  architectures 
for  developmental  robotics; 

•  advocate  intrinsically  motivated  RL  (Barto,  Singh, 
&  Chentanez  2004;  Singh,  Barto,  &  Chentanez  2004) 
as  an  especially  promising  approach  to  developmental 
learning;  and 

•  present  preliminary  results  applying  intrinsically  mo¬ 
tivated  RL  to  a  gridworld  abstraction  of  a  robot  do¬ 
main. 

In  the  next  section  we  briefly  review  RL  and  layered 
learning.  Next  we  review  a  recent  success  integrating 
RL  and  behavior-based  robotics,  using  a  distributed 
topological  map  as  an  intermediary  layer.  We  then  re¬ 
view  an  algorithm  for  intrinsically  motivated  RL,  and 
present  a  simple  gridworld  experiment  illustrating  its 
potential.  Finally,  we  discuss  the  benefits  of  this  ap¬ 
proach  and  advocate  a  layered  architecture  for  bringing 
this  approach  to  bear  on  embodied  systems,  as  well  as 
other  directions  for  future  work. 

Background 
Reinforcement  Learning 

Reinforcement  learning  (Sutton  &  Barto  1998)  aims  to 
solve  the  problem  of  a  behaving  agent  learning  to  ap¬ 
proximate  an  optimal  behavioral  policy  through  inter¬ 
action  with  an  environment.  This  generally  takes  the 
form  of  learning  to  maximize  a  numerical  reward  signal 
over  time  in  a  given  environment.  This  reward  signal  is 
the  only  learning  feedback  obtained  from  the  environ¬ 
ment,  and  thus  RL  falls  somewhere  between  unsuper¬ 
vised  learning  (where  no  signal  is  given  at  all)  and  su¬ 
pervised  learning  (where  a  signal  indicating  the  correct 
action  is  given),  which  makes  it  well  suited  to  develop¬ 
mental  robotics. 

Most  RL  algorithms  adapt  dynamic  programming 
methods  to  focus  on  the  most  relevant  parts  of  the  value 
space — behavioral  trajectories.  State  or  state-action 
values  are  estimated  from  experience  and  “backed  up” 
to  compute  approximately  optimal  policies  of  actions — 
those  that  maximize  expected  long-term  reward.  The 
Markov  decision  process  is  a  popular  formalism  in  RL: 
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at  each  stage  the  agent,  in  one  of  the  set  of  possible 
states,  chooses  from  the  set  of  available  actions  an  ac¬ 
tion,  which  presumably  (stochastically)  influences  the 
agent’s  subsequent  transition  to  the  next  state,  receiv¬ 
ing  a  reward  in  the  process.  The  policy  maps  from 
states  and  actions  to  probabilities  of  executing  a  given 
action  in  a  given  state. 

Options  Options  (Precup  2000;  Sutton,  Precup,  & 
Singh  1999)  are  a  principled  framework  for  temporal 
abstraction  in  RL.  Briefly,  an  option  is  roughly  analo¬ 
gous  to  a  subroutine:  it  has  an  initiation  set  of  states 
in  which  it  can  be  invoked,  an  internal  policy  mapping 
states  and  actions  to  probabilities  of  execution,  and  a 
termination  condition  mapping  states  to  the  probabil¬ 
ity  of  the  option  terminating  in  that  state.  When  an 
option  is  invoked  it  follows  its  internal  policy  until  ter¬ 
mination;  this  allows  an  option  to  be  considered  a  tem¬ 
porally  extended  action,  freeing  the  agent  from  needing 
to  choose  an  action  at  each  step.  One  option’s  policy 
may  call  another  option,  creating  an  elegant  mechanism 
for  behavioral  hierarchy. 

The  options  framework  has  a  solid  theoretical  foun¬ 
dation,  extending  Markov  decision  processes  to  semi- 
Markov  decision  processes  (Barto  &  Mahadevan  2003), 
and  two  components  of  the  options  framework  are  par¬ 
ticularly  important  to  the  algorithm  presented  below: 

Option  Models  are  probabilistic  descriptions  of  the 
effects  of  executing  an  option.  They  can  be  (approx¬ 
imately)  learned  from  experience,  and  allow  stochas¬ 
tic  planning  to  be  extended  from  primitive  (one-step) 
actions  to  higher  levels  of  abstraction. 

Intra-option  Learning  Methods  allow  the  internal 
policies  of  many  options  to  be  updated  simultane¬ 
ously,  regardless  of  which  option  is  actually  execut¬ 
ing. 

In  most  of  the  work  using  options,  the  options  must 
be  hand-designed  by  the  engineer  in  advance.  It  is 
clear  that  dynamically  creating  and  learning  options  is 
a  desirable  ability,  and  several  researchers  have  recently 
proposed  methods  for  doing  so,  e.g.  (§im§ek  &  Barto 
2004),  (McGovern  2002).  This  work  falls  into  that  cate¬ 
gory,  and  is  unique  in  that  rather  than  creating  options 
tailored  to  a  specific  task,  our  algorithm  creates  options 
based  on  intrinsic  motivation. 

Layered  Learning 

Because  it  has  so  many  attractive  properties,  several 
researchers  have  added  RL  capabilities  to  their  robots. 
However,  applying  RL  directly  over  the  robot’s  (very 
large)  sensor  space  often  leads  to  convergence  problems 
due  to  violations  of  the  Markov  assumption  and  the 
sheer  enormity  of  the  space.  One  solution  to  this  is  the 
use  of  layered  learning  to  provide  a  suitably  abstract 
problem  space  that  RL  can  solve  efficiently. 

It  is  now  widely  accepted  that  a  layered  and  incre¬ 
mental  approach  to  designing  robot  control  systems 
(Brooks  1986)  works  well  in  practice,  and  further  that 


the  interaction  of  layered,  parallel  control  elements 
can  produce  interestingly  complex  adaptive  behavior 
(Pfeifer  &  Scheier  1999). 

By  the  same  token,  we  argue  that  learning  ele¬ 
ments  should  be  layered  in  a  robot’s  control  system 
in  the  same  way  that  more  static  control  elements 
are.  There  are  several  examples  of  layering  RL  on 
top  of  a  behavioral  basis,  e.g.  (Huber  &  Grupen 
1997).  The  natural  extension  is  to  add  additional  learn¬ 
ing  layers  in  between.  Layered  learning  (Stone  2000; 
Utgoff  &  Stracuzzi  2002)  means  that  we  can  use  lower- 
level  learning  elements  to  learn  useful  structures,  dis¬ 
cretizations,  and  behavior  that  can  help  make  higher- 
level  learning  feasible,  and  allows  for  the  interaction  of 
several  learning  elements  to  generate  complex  adaptive 
behavior. 

One  example  of  the  approach  described  above  is  the 
layered,  distributed  and  asynchronous  reinforcement 
learning  model  developed  by  Konidaris  &  Hayes  (2005) . 
RL  was  layered  on  top  of  a  learned  topological  map, 
which  itself  was  layered  on  top  of  a  reactive  behavioral 
substrate  on  a  robot  to  perform  puck  foraging  in  an 
artificial  arena.  The  use  of  a  reactive  behavioral  sub¬ 
strate  created  conditions  under  which  the  topological 
map  could  be  easily  learned,  while  the  topological  map 
served  to  keep  the  state  space  small  and  task  relevant. 
This,  coupled  with  the  use  of  asynchronous  and  paral- 
lelizable  updates  that  took  advantage  of  the  fact  that 
computation  is  very  much  faster  than  action  in  embod¬ 
ied  domains,  allowed  the  robot’s  RL  element  to  con¬ 
verge  in  real  time,  between  decisions. 

This  worked  well,  but  the  learning  dynamics  it  dis¬ 
played  were  those  of  traditional  RL:  a  task-specific,  ex¬ 
ternally  imposed  reward  function  was  used  to  achieve 
a  certain  behavior,  after  which  no  additional  learning 
took  place.  The  elegance  of  a  layered  architecture  is 
that  these  dynamics  can  be  addressed  at  the  level  of 
the  RL  layer,  taking  the  lower  behavioral  and  topologi¬ 
cal  levels  for  granted1.  We  thus  turn  our  attention  now 
to  a  RL  system  designed  to  display  the  task-general, 
open-ended  learning  dynamics  emphasized  in  the  de¬ 
velopmental  robotics  approach. 

Intrinsically  Motivated  Reinforcement 
Learning 

Barto,  Singh,  &  Chentanez  (2004)  introduce  a  model  of 
intrinsically  motivated  reinforcement  learning  employ¬ 
ing  the  options  framework.  The  model  is  grounded  in  an 
elaboration  of  the  traditional  conception  of  RL,  wherein 
the  environment  is  “factored”  into  an  external  environ¬ 
ment  and  an  environment  internal  to  the  agent.  It  is 
this  internal  environment  which  provides  the  reward 
signal  to  the  RL  system.  Note  that  this  elaboration 
still  allows  for  rewards  from  the  external  environment: 
these  are  simply  “transduced”  by  the  internal  environ¬ 
ment. 

1In  principle,  at  least.  We  recognize  that  in  practice 
things  are  rarely  quite  that  simple. 


In  the  traditional  approach  to  RL  the  reward  func¬ 
tion  is  tailored  specifically  to  the  task  at  hand  (navigat¬ 
ing  a  maze,  or  winning  at  backgammon,  for  example), 
and  crafting  this  reward  function  can  require  significant 
ingenuity.  The  notion  of  intrinsically  motivated  RL 
is  that  the  critic  in  the  internal  environment  includes 
the  agent’s  motivational  system — and  that  this  moti¬ 
vational  system  should  be  sophisticated  and  general, 
and  should  not  need  to  be  redesigned  for  each  specific 
task  the  agent  undertakes.  Driven  by  this  task-general 
intrinsic  motivation,  the  agent  builds  up  a  hierarchical 
collection  of  skills — in  effect  achieving  broad  competence 
in  its  environment.  These  skills  can  then  be  applied  to 
any  specific  task  the  agent  finds  itself  called  upon  to 
learn. 

There  are  many  possibilities  for  the  source  of  intrin¬ 
sic  motivation,  including  surprise,  novelty  (Huang  & 
Weng  2002),  or  “learning  progress”  (Kaplan  &  Oudeyer 
2004).  Thus  far  the  neuroscience  of  dopamine  neu¬ 
rons  (Horvitz,  Stewart,  &  Jacobs  1997)  has  been  the 
most  direct  inspiration  for  the  implemented  model  of 
intrinsic  motivation  (Barto,  Singh,  &  Chentanez  2004; 
Singh,  Barto,  &  Chentanez  2004),  and  our  experiments 
here  also  follow  that  path,  although  we  plan  to  explore 
other  sources  of  intrinsic  motivation  in  future  work. 

With  its  emphasis  on  task-general,  self- motivated, 
and  hierarchical  learning,  intrinsically  motivated  RL  is 
an  obvious  choice  for  developmental  robotics.  Experi¬ 
ence  with  intrinsic  motivation  in  hierarchical  RL  is  still 
very  preliminary,  and  the  only  experimental  results  to 
date  are  on  an  abstract  grid  world.  This  paper  presents 
an  application  of  this  approach  to  a  domain  intended 
to  be  a  stepping  stone  from  the  abstract  environment 
presented  in  (Barto,  Singh,  &  Chentanez  2004)  to  a  real 
robotic  domain.  While  still  discrete  and  deterministic, 
the  domain  is  more  “robot-like”  than  the  prior  work  on 
intrinsically  motivated  RL,  and  we  believe  that  given 
the  appropriate  support  from  other  layers  in  a  learning 
architecture  (such  as  a  learned  topological  map),  the 
approach  will  be  adaptable  to  real  robotic  applications. 

We  next  describe  the  specifics  of  the  intrinsically  mo¬ 
tivated  RL  algorithm  used,  which  is  based  very  closely 
on  the  algorithm  presented  by  Singh,  Barto,  &  Chen¬ 
tanez  (2004) ,  and  then  describe  an  experiment  illustrat¬ 
ing  its  behavior. 

The  Algorithm 

The  algorithm  for  intrinsically  motivated  RL  departs 
from  traditional  RL  mostly  in  its  use  of  intrinsic  reward 
to  learn  a  collection  of  useful  skills.  In  many  other 
respects  the  algorithm  is  a  combination  of  established 
algorithms  for  hierarchical  RL.  The  description  here, 
although  organized  differently,  is  similar  that  in  (Singh, 
Barto,  &  Chentanez  2004),  where  further  details  can  be 
found. 

Saliency  Present  implementations  depend  on  the 
hardwired  salience  of  certain  stimuli  or  events  in  the 
agent’s  environment  (although  it  bears  repeating  that 


the  larger  idea  of  intrinsically  motivated  RL  does  not 
depend  on  this  particular  model  of  intrinsic  motiva¬ 
tion).  For  example,  in  the  experiments  described  be¬ 
low,  changes  in  light  or  sound  intensity  are  considered 
salient.  We  consider  such  notions  of  saliency  to  be 
roughly  analogous  to  the  saliency  of  certain  stimuli — 
such  as  the  smell  of  food  or  movement  of  a  potential 
threat — that  are  hardwired  by  evolution  into  the  ner¬ 
vous  systems  of  animals  in  nature.  These  stimuli  are 
by  necessity  specific  to  the  animal’s  ecological  niche, 
but  are  general  with  respect  to  specific  settings  within 
that  niche  and  with  respect  to  specific  tasks  or  skills 
the  animal  undertakes. 

Skills  The  first  time  the  agent  experiences  a  given 
salient  event  it  creates  and  initializes  an  option  to  bring 
about  that  event.  An  event  option’s  initiation  set  is 
initialized  to  include  the  state  just  prior  to  the  salient 
event,  and  the  termination  probability  for  the  state  in 
which  the  event  occurred  is  initialized  to  one.  In  addi¬ 
tion,  an  option  model  is  initialized  for  the  event  option, 
estimating  the  probability  of  the  option  terminating  in 
a  given  state  with  a  given  cumulative  reward  when  exe¬ 
cuted  from  a  given  state.  As  the  agent  gains  experience 
the  option’s  initiation  set  grows  to  include  states  that 
lead  to  states  in  the  current  initiation  set,  and  when¬ 
ever  the  agent  experiences  a  salient  event  in  a  novel 
state  the  termination  probability  of  the  option  for  that 
event  is  set  to  one.  The  algorithm  updates  the  option 
policies  and  option  models  of  all  options  simultaneously 
using  intra-option  learning.  Once  initialized,  an  option 
is  available  as  an  action  to  other  options  as  well  as  to 
the  behavioral  (top-level)  policy,  which  provides  an  el¬ 
egant  and  natural  way  of  building  a  hierarchy  of  skills. 

Intrinsic  Reward  The  implementation  of  intrin¬ 
sic  reward  associated  with  these  salient  stimuli  is  in¬ 
spired  by  the  response  of  dopamine  neurons  to  nov¬ 
elty  (Horvitz,  Stewart,  &  Jacobs  1997).  The  intrinsic 
reward  for  the  occurrence  of  a  salient  event  is  propor¬ 
tional  to  the  prediction  error  of  that  event  in  the  learned 
option  model  for  that  event.  Thus  when  an  event  first 
occurs,  or  occurs  in  a  previously  unexperienced  context, 
its  intrinsic  reward  will  initially  be  high  (it  will  be  ‘sur¬ 
prising’  or  ‘interesting’).  While  the  event  option  policies 
are  updated  only  with  respect  to  the  extrinsic  reward 
signal  (if  any)  and  a  (hardwired)  reward  for  successfully 
terminating  an  option,  the  behavioral  policy  incorpo¬ 
rates  the  intrinsic  reward  in  its  update.  Thus  ‘surprise’ 
drives  the  agent  to  try  to  bring  the  event  about.  How¬ 
ever,  as  the  agent  repeatedly  does  so  it  becomes  better 
at  both  bringing  about  and  predicting  the  occurrence 
of  the  event.  As  the  event  becomes  more  predictable  it 
becomes  less  rewarding  (the  agent  gets  ‘bored’).  The 
algorithm  also  naturally  handles  extrinsic  reward,  but 
importantly  it  does  not  depend  on  it. 

Behavior  The  agent  behaves  with  an  e-greedy  pol¬ 
icy  with  respect  to  its  behavioral  action- value  function. 
The  behavioral  action-value  function  is  learned  through 


a  combination  of  Q-learning  and  SMDP  planning,  and 
maps  states  and  actions  (initially  only  the  primitive  ac¬ 
tions;  options  are  included  as  they  become  available)  to 
their  expected  long-term  reward. 

An  Experiment 

We  now  present  preliminary  experimental  results 
demonstrating  the  performance  of  the  algorithm.  In 
this  work  we  assume  the  existence  of  lower  layers  of 
learning  sufficient  for  supporting  a  high  level  represen¬ 
tation  of  the  state  and  action  spaces.  This  assump¬ 
tion  allows  us  to  test  our  ideas  in  a  simple  gridworld, 
where  we  can  focus  on  the  high-level  behavior  we  wish 
to  demonstrate.  While  we  believe  the  layered  approach 
successfully  demonstrated  by  Konidaris  &  Hayes  (2005) 
and  discussed  above  gives  us  cause  to  believe  that  this 
temporary  abstraction  is  justified,  we  also  recognize  the 
danger  of  these  assumptions.  Future  work  integrating 
intrinsically  motivated  RL  into  a  layered  learning  archi¬ 
tecture  on  a  physically  embodied  robot  will  inevitably 
present  challenges  not  addressed  in  the  present  work, 
but  we  do  not  believe  this  detracts  from  the  promise 
of  intrinsically  motivated  RL  as  a  layer  driving  open- 
ended  learning  on  developing  robots. 

Experimental  Setup 

The  gridworld  used  in  this  work  (figure  1)  is  an  ab¬ 
straction  of  a  ‘playworld’  environment  which  has  actu¬ 
ally  been  built  on  an  Aibo-scale  by  some  of  our  col¬ 
leagues  for  future  experimentation  with  these  ideas  on 
Aibo  robots.  The  world  consists  of  two  rooms  with  a 
door  between  them.  There  are  push-panels  on  the  walls 
which,  when  pushed,  turn  the  lights  on  or  off  or  open 
or  close  the  door.  The  second  room  contains  a  charger. 

The  robot  perceives  its  location  and  orientation  (we 
assume  these  are  provided  by  the  topological  map,  for 
instance),  a  list  of  visible  objects,  light  intensity,  and 
various  sounds.  It  can  move  forward,  rotate  clockwise 
or  counterclockwise,  approach  any  object  it  can  see, 
push  a  push-panel,  or  charge  itself.  Changes  in  light 
and  sound  are  hardwired  as  salient  events. 

The  robot  starts  at  a  random  location  in  the  left 
room,  in  the  dark.  It  can  see  the  glowing  push-panel, 
but  it  can  not  see  the  other  push-panel  or  the  door  un¬ 
til  it  has  turned  on  the  light  by  pushing  the  glowing 
push-panel.  Pushing  the  other  push-panel  will  open 
the  door,  causing  an  alarm  to  ring.  When  it  is  facing 
the  charger  it  may  charge  itself,  which  causes  a  bell  to 
ring  and  earns  an  extrinsic  reward.  A  small  extrinsic 
punishment  is  given  at  each  time  step  as  a  ‘cost  of  liv¬ 
ing’.  Every  250  steps  the  robot  is  ‘kidnapped’,  and  the 
experiment  is  reset  to  initial  conditions  with  the  robot 
placed  in  a  random  location  in  the  left  room. 

The  world  was  designed  to  include  objects  engaged 
with  varying  levels  of  difficulty.  Engaging  the  light 
switch  is  easy,  but  engaging  the  charger  requires  a  num¬ 
ber  of  intermediate  steps.  Clearly,  if  the  robot  has  al¬ 
ready  learned  skills  to  turn  the  light  on  and  open  the 


Figure  1:  The  gridworld  environment. 


door,  these  will  be  of  use  in  learning  to  engage  the 
charger. 

Results 

Barto,  Singh,  &  Chentanez  (2004)  present  results  from 
applying  the  intrinsically  motivated  RL  algorithm  in  a 
smaller  and  more  abstract  gridworld.  They  show  that 
as  expected  (and,  indeed,  designed),  the  agent  gains 
competence  by  first  learning  to  achieve  ‘easy’  salient 
stimuli,  and  then  building  on  these  skills  to  achieve 
more  ‘difficult’  stimuli.  When  the  agent  first  encoun¬ 
ters  a  salient  stimulus  it  receives  high  intrinsic  reward, 
but  as  it  learns  to  predict  that  event  the  level  of  intrin¬ 
sic  reward  drops,  until  it  encounters  that  event  again 
unexpectedly. 

This  work  is  ongoing  and  the  results  we  report  here 
are  similar  but  still  very  preliminary  in  nature.  Figure 
2  shows  a  record  of  salient  event  occurrences  over  the 
course  of  the  experiment.  At  the  beginning  the  robot 
quickly  discovers  and  learns  to  predict  turning  the  light 
on  and  off.  Later  it  learns  to  open  and  close  the  door, 
and  soon  after  learns  to  charge  itself  (ringing  a  bell), 
a  behavior  which  persists  because  it  is  extrinsically  re¬ 
warding. 

Figure  3  plots  the  number  of  steps  (from  the  initi¬ 
ation  of  a  testing  period)  to  the  achievement  of  each 
of  the  salient  events.  An  initial  period  of  exploration  is 
visible  in  the  first  few  thousand  steps  (due  to  optimistic 
initial  values).  At  about  2500  steps  the  robot  starts  to 
consistently  achieve  the  light  on  and  off  events,  and  at 
roughly  12500  steps  it  discovers  how  to  open  the  door 
and  ring  the  bell.  A  second  period  of  exploration  en¬ 
sues,  and  at  the  end  of  the  experiment  we  can  see  that 
the  robot  has  learned  to  consistently  turn  on  the  light, 
open  the  door,  and  ring  the  bell  efficiently.  Note  that 
it  also  learns  that  turning  the  light  off  and  closing  the 
door  are  not  worth  the  effort. 

Discussion  and  Future  Work 

While  we  believe  intrinsically  motivated  reinforcement 
learning  has  many  features  that  make  it  an  appeal¬ 
ing  approach  to  developmental  robotics,  it  is  clear  that 
much  remains  to  be  done  to  demonstrate  the  viability 
of  the  approach.  A  first  step  is  to  more  thoroughly 
demonstrate  the  algorithm’s  performance  on  the  grid- 
world  presented  above.  Once  that  has  been  accom- 
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Figure  2:  Records  of  intrinsic  rewards  for  the  occurrence  of  salient  events.  The  left  plot  shows  the  first  5000  steps  of  the 
experiment,  illustrating  the  exponential  drop  in  intrinsic  reward  as  salient  events  become  predictable.  The  right  plot  shows 
the  full  25000  steps  of  the  experiment  in  detail,  with  small  intrinsic  rewards  for  predicted  salient  events  indicating  sustained 
charging  behavior.  The  regular  occurrence  of  the  Light  On  event  is  due  to  the  periodic  ‘reset’  of  the  experiment. 


plished,  we  hope  to  adapt  the  algorithm  to  one  suitable 
for  application  on  a  real  robot.  To  do  this  we  propose 
adopting  the  layered  approach  discussed  above  in  order 
to  provide  the  intrinsically  motivated  RL  layer  with  a 
tractable  problem  space. 

One  of  the  challenges  of  applying  RL  in  the  real  world 
is  the  issue  of  efficiency.  No  consideration  of  efficiency 
has  been  made  so  far  with  respect  to  intrinsically  moti¬ 
vated  RL,  and  there  are  several  obvious  improvements 
that  could  be  made  (e.g.  eligibility  traces).  As  dis¬ 
cussed  above,  the  mismatch  between  computation  time 
for  a  RL  update  and  the  time  it  generally  takes  a  robot 
to  take  an  action  changes  the  efficiency  dynamic  dra¬ 
matically,  as  it  may  be  possible  to  perform  dynamic 
programming  to  the  point  of  convergence  between  de¬ 
cisions  (Konidaris  &  Hayes  2005). 

Other  directions  for  future  work  are  more  specific 
to  the  particular  challenges  of  developmental  robotics. 
One  shortcoming  of  the  current  model  of  intrinsic  mo¬ 
tivation  is  that  intrinsic  reward  is  based  on  a  failure 
to  predict  a  (salient)  event.  However,  as  Kaplan  & 
Oudeyer  (2003)  have  demonstrated,  this  can  lead  to  un¬ 
desirable  behavior  in  environments  involving  areas  with 
dynamics  that  are  difficult  or  impossible  to  predict.  As 
they  and  others  have  argued,  a  better  approach  is  to 
note  that  learning  is  most  fruitful  in  a  “zone  of  prox¬ 
imal  development” — areas  that  are  learnable,  neither 
too  predictable  (already  learned)  nor  too  unpredictable 
(impossible  to  learn).  As  mentioned  briefly  above,  this 


is  just  one  of  several  sources  of  intrinsic  motivation  we 
hope  to  explore. 

It  is  also  worth  considering  what  level  of  primitive 
actions  are  to  be  engineered  and  considered  ‘innate’. 
The  present  work  assumes  a  relatively  high-level  be¬ 
havioral  basis,  while  much  work  in  the  developmental 
robotics  community  concentrates  on  developing  lower- 
level  sensorimotor  coordination.  While  it  is  clear  that 
much  is  built-in  in  nature  (e.g.  some  mammals  can  walk 
within  hours  after  being  born) ,  it  is  also  clear  that  learn¬ 
ing  takes  place  to  refine  sensorimotor  coordination  (e.g. 
Berthier,  Rosenstein,  &  Barto  2005),  and  moving  learn¬ 
ing  ‘down  the  hierarchy’  removes  engineer  bias  (Blank, 
Kumar,  &  Meeden  2002)  and  leaves  more  room  for  on¬ 
line  adaptation.  To  what  extent  intrinsic  motivation 
is  important  for  such  low-level  learning  is  an  open  and 
important  question  we  hope  to  address  in  the  future. 

Conclusion 

This  paper  has  discussed  an  algorithm  for  intrinsically 
motivated  reinforcement  learning  and  argued  that  it  has 
many  characteristics  that  make  it  appealing  for  develop¬ 
mental  robotics.  Intrinsic  motivation  drives  the  system 
to  learn  and  gain  broad  competence  in  a  task-general 
manner,  and  the  hierarchical  RL  framework  provides 
an  elegant  means  of  building  on  previous  learning  in  an 
open-ended  fashion.  However,  while  intrinsically  moti¬ 
vated  RL  is  well  suited  to  situated  learning,  as  currently 
formulated  it  is  not  well  suited  to  learning  directly  in 
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Figure  3:  Steps  to  achievement  of  each  salient  event,  tested 
every  250  steps.  Achievement  either  occurred  within  60 
steps  or  not  within  150,  hence  the  graphs’  upper  limit.  (The 
vertical-oriented  lines  show  transitions  between  periods  of 
achievement,  lower,  and  non-achievement,  higher.)  The  pro¬ 
gressive  learning  of  increasingly  complex  skills  can  be  seen. 


the  high-dimensional,  continuous  problem  space  that 
physical  embodiment  involves.  We  thus  advocated  a 
layered  approach  to  learning  architectures  for  develop¬ 
mental  robotics,  wherein  the  lower  layers  of  learning 
provide  a  tractable  space  for  the  upper  layers.  Much 
remains  to  be  done,  but  we  believe  these  ingredients 
hold  great  promise  for  developmental  robot  learning. 
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